
5 Genomics and Epigenomics Analysis
5.1 On this page
Biological insights and take-home messages are at the bottom of the page at Lesson Learnt: Section 5.5.
- Here we investigate all the genomics and epigenomics data available for the three kidney carcinomas.
- First, we analyse the gene Copy Number Variants (CNVs).
- We test CNVs distribution across the three kidney carcinomas and investigate any signature associated with them.
- We perform some exploratory analyses on samples, CNVs and clinical covariates.
- The, we look into the somatic mutations detected across the three kidney carcinomas.
- We test the somatic mutations distribution across the three kidney carcinomas and investigate any signature associated with them.
- We perform some exploratory analyses on samples, somatic mutations and clinical covariates.
- Finally, we focus on DNA methylation and epigenomics signals.
- We perform a QC on probe intensities (corresponding to CpG islands different methylation levels), their distribution across samples and we impute eventual missing values.
- We do some exploratory analyses on samples, probe intensities and clinical covariates.
- We then run a formal Differential Expression analysis to identify probes that have different methylation levels across the three Kidney cancer types, and the genes associated to them.
- We perform Gene Set Enrichment Analyses on the genes associated with differentially methylated CpG islands to investigate biological and molecular themes that discriminates between the three Kidney cancer types.
5.2 Copy Number Variants (CNVs)
5.2.1 Filtering & QC
The copy number variants are available at TCGA as information collapsed at the gene level. CNVs information are reported for 24,776 genes. For each gene, we have an integer value ranging from -2 (complete deletion of the genomic region) to +2 (complete duplication of both alleles), with 0 indicating no CNV event for that gene.
All of the 24,776 reported genes are affected by at least a CNV event across 877 kidney carcinomas samples, only for 10 samples we do not have CNV information.
Patients affected by KICH kidney carcinomas seem to have a significantly higher number of genes affected by CNVs, followed by KIRP patients. KIRC patients have the least amount of genes affected by CNVs. In KICH and KIRP patients, also, CNVs involving gene duplications are more abundant than the ones involving gene deletions.
5.2.2 CNVs signatures across Kidney cancers
Let’s now investigate the presence of any CNV signature across the three kidney carcinomas. As mentioned before, the provided TCGA CNVs data are reported at the gene level, but obviously, Whole Genome Duplication or Chromosomal Aberration events would usually be larger than a single gene and span several genes. Since we do not have access to the raw genomics data, a first naive approach would be to walk each chromosome and bridge together as single CNV event regions between consecutive genes having the same CNV score as reported by TCGA. This would allow to collapse the information of 24,776 different segments (the reported genes) into tens or hundreds of long, consecutive genomics ranges. The approach would work well under the assumption that large CNV events are much more likely to happen than short, isolated CNV events, i.e.: spanning just one or few genes. Moreover, the available CNV data are limited to the gene level (~1-2% of total human genome length), which implies that we have no direct information on CNV events (or lack thereof) over 98% of the genome. So, we have decided to do not bridge the genomic regions across consecutive genes with the same CNV score as single CNV events, and keep the downstream CNV analyses at the gene level.

As seen in Figure 1, biopsies from KICH patients show extensive CNV events that seems to span most of the coding genome. CNVs events seems to have similar patterns across the different kidney carcinomas, and they appear to span full-length chromosomes. Within each kidney carcinoma type, it seems like biopsies could be clustered in different subtypes dependeing on their CNV signatures.
In the next steps, we will try to correlates these CNV signatures to other clinical covariates.
5.2.3 Dimesionality Reduction and Dataset Exploration
5.2.3.1 UMAP on CNVs data
Let’s check how the samples cluster on a UMAP. For transcriptomics data (see Section 2.3.1) we have drawn the UMAP based on the top 1,000 most variable genes. Given the nature (and distribution) of the underlying CNV data, we have decided to draw the UMAP based on the top 1,000, 2,000, 3,000, 4,000, 5,000 and 10,000 most variable CNV-affected genes.
Increasing the number of genes to compute the UMAP improved the clustering between the samples of the different kidney carcinomas. A cluster of overlapping biopsies from the three kidney carcinomas is visible on the right hand side of N=1000, N=2000, N=3000, N=10000 and on the left-hand side of N=5000 suggesting that CNVs can poorly discriminate between KIRC, KIRP and KICH only on a subset of CNV subtypes.

5.2.3.2 PCA
As we did for the other omics data, the next step in the dataset exploration is to perform the Principal Component Analysis.

The first 18 Principal Components capture more than 80% of the variance in the Kidney cancers CNVs dataset, with the first two components (PC1 and PC2) capturing a bit more than 33% of the variance.
When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC, KICH and KIRP, which instead cluster together. As observed in the UMAP, the clustering is more noisy than what observed for the other omics analyses (transcriptomics, proteomics and micro-RNAs).

We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC1 PC4 seems to best separate samples from KIRC, KIRP and KICH.

Let’s check the Pearson correlation with other clinical covariates.
Cancer_type correlates well with PC1 and PC4. Most of the TCGA-defined molecular subtypes correlates with PC1, PC2 and PC4. Interestingly, TCGA Subtype_CNA, which should be based on he CNV data, correlates with PC3 and PC4, but not PC1. PC3 is interesting since it correlates with follow_up_tumor_status, pathologic_t and tumor_stage, suggesting a link between CNVs signatures and cancer advancement, which is a common feature across all kidney carcinomas and not limited to KIRC, KIRP or KICH.

5.3 Somatic mutations
Let’s now dig into the Somatic Mutations detected in the biopsies of the kidney carcinomas patients. We have somatic information for 711 samples (~80% of the total 887 patients in the cohort). TCGA provides information on somatic mutations as binary information for each gene: 1, if the patient has a non-silent mutation for that gene, or 0, if the patient does not have a non-silent mutation on that gene. We could retrieve info for for 40,542 genes, and 13,584 of them had a non-silent mutations in at least one of our 711 kidney carcinomas patients. The first step was then to discard from downstream analysis the 26,958 that appeared not mutated in the kidney carcinomas biopsies.
Let’s now look at the distribution of somatic mutations across the different kidney carcinomas. KICH patients had an average of 31 genes with non-mutations, against the 53 and 56 average number of genes with non-silent mutations in KIRC and KIRP patients, respectively. Chromosomes 1, 2, 3, 11, 17 nad 19 where the chromosomes with more mutated genes on average.

5.3.1 Mutational signatures across Kidney cancers
Let’s now investigate the presence of any non-silent somatic mutation signature across the three kidney carcinomas.

While for KICH and KIRP patients the non-silent somatic mutations seems to be evenly distributed across the genome, KIRC patients clearly show an over-abundance of 3 preferably mutated genes, one in chromosome 2 and two in chromosome 3.
Using external_gene_name as id variables
In KICH, ~30% of the patients (21 / 66) had a non-silent mutation in TP53, followed by 9% of patients (6 / 66) having a mutation on PTEN. As expected, KIRC patients show the highest mutational burden, with ~47% of patients (170 / 365) with mutation in VHL, ~40% of patients (145 / 365) with mutation on PBRM1 and ~20% of patients (72 / 365) with non-silent somatic mutations on TTN. As well, 16% of KIRP patients (45 / 280) has non-silent somatic mutations on TTN gene.

In the table below are reported all the 19,573 somatic mutations detected across the biopsies of the three kidney carcinomas.
5.3.2 Dimesionality Reduction and Dataset Exploration
5.3.2.1 UMAP on filtered transcriptomics data
Let’s check how the samples cluster on a UMAP. For transcriptomics data (see Section 2.3.1) we have drawn the UMAP based on the top 1,000 most variable genes. Given the nature (and distribution) of the underlying CNV data, we have decided to draw the UMAP based on the top 1,000, 2,000, 3,000, 4,000, 5,000 and 10,000 most variable CNV-affected genes.
Increasing the number of genes to compute the UMAP improved the clustering between the samples of the different kidney carcinomas. A cluster of overlapping biopsies from the three kidney carcinomas is visible on the right hand side of N=1000, N=2000, N=3000, N=10000 and on the left-hand side of N=5000 suggesting that CNVs can poorly discriminate between KIRC, KIRP and KICH only on a subset of CNV subtypes.

5.3.2.2 PCA

The first 24 Principal Components capture more than 80% of the variance in the Kidney cancers transcriptomics dataset, with the first two components (PC1 and PC2) capturing a bit more than 25% of the variance.
When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC from KICH adn KIRP, which instead cluster together. The second component PC2, instead, seems to partially separate KICH and KIRP samples.
We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC4 seems to separates better the KICH from KIRP, while PC1 can discriminate between KIRC and KIRP.


5.4 Epigenomics
646 samples with methylation info 148068 whitelisted probes
Regardless of array type, both the 450k and EPIC record two measurements for each CpG: a methylated intensity (M) and an unmethylated intensity (U). Using these values, the proportion of methylation at each site CpG locus can be determined. The level of methylation at a locus is commonly reported as the Beta-value, i.e. the ratio of the methylated probe intensity and the overall intensity:
beta = M/(M/U)
Illumina recommends adding a constant offset α (by default, α = 100) to the denominator to regularize Beta value when both methylated and unmethylated probe intensities are low. The Beta-value statistic results in a number between 0 and 1, or 0 and 100%. Under ideal conditions, a value of zero indicates that all copies of the CpG site in the sample were completely unmethylated (no methylated molecules were measured) and a value of one indicates that every copy of the site was methylated.
5.4.1 Filtering & QC
removed 6364 probes with missing values
148068-141704 [1] 6364

5.4.2 Dimesionality Reduction and Dataset Exploration
5.4.2.1 UMAP on filtered transcriptomics data
After we excluded biopsys from normal tissues and other tumors and we filtered out the lowly expressed genes, the UMAP shows three clusters that are better refined than the ones depicted in Section 1.2.1, and that roughly correspondes with three different kidney cancer subtypes.

5.4.2.2 PCA

The first 24 Principal Components capture more than 80% of the variance in the Kidney cancers transcriptomics dataset, with the first two components (PC1 and PC2) capturing a bit more than 25% of the variance.
When we project the samples in the PC1 and PC2, we can see that the PC1 separates KIRC from KICH adn KIRP, which instead cluster together. The second component PC2, instead, seems to partially separate KICH and KIRP samples.

We can also investigate other dimensions Principal Components, to see if there is a component that manages to fully resove the three cancer types. PC4 seems to separates better the KICH from KIRP, while PC1 can discriminate between KIRC and KIRP.


cg03830585 –> ITPR1 protects renal cancer cells against natural killer cells by inducing autophagy
cg00868875 –> KCTD1
cg07037412 –> CACNA1H Voltage-gated calcium channels: Novel targets for cancer therapy CACNA1H was downregulated in gastrointestinal stromal tumor, sarcoma and renal cancer Notably, compared with our previous research, CACNA1H was specifically overexpressed relative to normal tissue samples in renal cancer, sarcoma and gastrointestinal stromal tumors.

5.4.3 Probe-Wise Differential Methylation

print(summary(dt)) KIRC_vs_KICH KIRP_vs_KICH KIRC_vs_KIRP Down 56952 43222 69811 NotSig 45814 51748 42083 Up 45302 53098 36174


Tables of differentially methylated probes.